Evaluating two methods for Treebank grammar compaction
نویسندگان
چکیده
Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision.
منابع مشابه
Compacting the Penn Treebank Grammar
Treebanks, such as the Penn Treebank (PTB), offer a simple approach to obtaining a broad coverage grammar: one can simply read the grammar off the parse trees in the treebank. While such a grammar is easy to obtain, a square-root rate of growth of the rule set with corpus size suggests that the derived grammar is far from complete and that much more treebanked text would be required to obtain a...
متن کاملExperiments in Structure Preserving Grammar Compaction
Structure preserving grammar com-paction (SPC) is a simple CFG com-paction technique originally described in (van Genabith et al., 1999a, 1999b). It works by generalising category labels and in so doing plugs holes in the grammar. To date the method has been tested on small corpra only. In the present research we apply SPC to a large grammar extracted from the Penn Treebank and examine its eeec...
متن کاملTreebank vs. Xbar-based Automatic F-Structure Annotation
Manual, large scale (computational) grammar development is time consuming, expensive and requires lots of linguistic expertise. More recently, a number of alternatives based on treebank resources (such as Penn-II, Susanne, AP treebank) have been explored. The idea is to automatically “induce” or rather read off (P)CFG grammars from the parse annotated treebank resources and to use the treebank ...
متن کاملTreebank vs. Xbar-based Automatic F-structure Annotation Treebank vs. Xbar-based Automatic F-structure Annotation
Manual, large scale (computational) grammar development is time consuming, expensive and requires lots of linguistic expertise. More recently, a number of alternatives based on treebank resources (such as Penn-II, Susanne, AP treebank) have been explored. The idea is to automatically \induce" or rather read oo (P)CFG grammars from the parse annotated treebank resources and to use the treebank g...
متن کاملSemi-Automatic Generation of F-Structures from Treebanks
Statistical approaches to processing Lexical Functional Grammars (LFG-DOP, [1]) require large corpora of text annotated with c-structure and f-structure representations. To date, such corpora that exist are constructed manually or semi-automatically. Manual construction is both time-consuming and error-prone. Semi-automatic construction usually proceeds as follows: an existing LFG grammar is us...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Natural Language Engineering
دوره 5 شماره
صفحات -
تاریخ انتشار 1999